1. Introduction

Philadelphia’s Indego Bikeshare program went into operation in 2015, though the feasibility study of bikeshare program in Philadelphia started with the Philadelphia City Council resolution in June 2008. The program today offers over 1400 bikes at over 140 stations. Many students, commuters, tourists, and everyday users utilize the bikeshare programs on daily basis. While we have seen a rise of dock-less bikehsare programs around the world, Indego is an example of a bikesharing system using the docking stations for customers. One of the key challenges for dock station based bikeshare program is the dealing with “re-balancing” bicycles to stations that are anticipated to have higher demand. Many stations that are located on the outer edges of the city will see higher demand during the peak morning commute time, while the stations located in the downtown or employment centers will see a higher demand during the evening rush hour. Operationally, the ability to predict this demand in time and spatial aspects will be crucial for a successful bikesharing system.

This analysis will look at the Indego ridership data from June of 2019 to predict both temporal and spatial demand of Indego bikes in Philadelphia. Using the ride specific data (such as time, pass type, etc) and weather data, the analysis will focus on building a model that will predict the demand as accurately as possible. An accurate model can be then utilized the Indego to make decisions on economic and efficient deploying re-docking truck drivers to make sure the high demand stations can meet the demand. Moreover, I will be testing the generelizibility of the model to better understand, whether this can be generalizeable for all the Indego stations.

2. Data

2.1 Loading Data

For this analysis, I will be using the Indego’s ridership data from June of 2019. Initially, I also wanted to use the June of 2020 to explore the difference in ridership between the two years. However, I ran into the issue of dealing with the date format the 2020 data is inputted as, so I will only be lookign at the 2019 data.

  • Indego June 2019 ridership data
  • Census data from ACS 2018 using tidycensus package
  • Weather data using riem package
# Load Philly Indego ridership data
indego_2019_q2 <- read_csv("indego-trips-2019-q2.csv") %>%
  filter(str_detect(start_time,"2019-06")) %>%
  tidyr::drop_na()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   trip_id = col_double(),
##   duration = col_double(),
##   start_time = col_datetime(format = ""),
##   end_time = col_datetime(format = ""),
##   start_station = col_double(),
##   start_lat = col_double(),
##   start_lon = col_double(),
##   end_station = col_double(),
##   end_lat = col_double(),
##   end_lon = col_double(),
##   bike_id = col_character(),
##   plan_duration = col_double(),
##   trip_route_category = col_character(),
##   passholder_type = col_character(),
##   bike_type = col_character()
## )
#indego_2020_q2 <- read_csv("indego-trips-2020-q2.csv")
#setting time bins
indego_2019_q2 <- indego_2019_q2 %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
selected_vars <- c("B02001_001E", # Estimate!!Total population by race -- ##let's double check that it's okay to use this as long as we justify it
                   "B02001_002E", # People describing themselves as "white alone"
                   "B02001_003E", # People describing themselves as "black" or "african-american" alone
                   "B15001_050E", # Females with bachelors degrees
                   "B15001_009E", # Males with bachelors degrees
                   "B19013_001E", # Median HH income
                   "B25058_001E", # Median rent
                   "B06012_002E", # Total poverty
                   "B08301_001E", # People who have means of transportation to work
                   "B08301_002E", # Total people who commute by car, truck, or van
                   "B08301_010E", # Total people who commute by public transportation"
                   "B03002_012E", # Estimate Total Hispanic or Latino by race
                   "B19326_001E", # Median income in past 12 months (inflation-adjusted)
                   "B07013_001E", # Total households
                   "B08013_001E",
                   "B01002_001E")

phila_census <- 
  get_acs(geography = "tract", 
          variables = selected_vars, 
          year=2018, 
          state="PA",
          county = c("Philadelphia"),
          geometry=T, 
          output="wide") %>%
  #st_transform('ESRI:102658') %>%
  rename(Med_Age = B01002_001E,
         TotalPop = B02001_001E, 
         Whites = B02001_002E,
         Blacks = B02001_003E,
         FemaleBachelors = B15001_050E, 
         MaleBachelors = B15001_009E,
         MedHHInc = B19013_001E,
         TotalPoverty = B06012_002E,
         TotalCommute = B08301_001E,
         CarCommute = B08301_002E,
         PubCommute = B08301_010E,
         TotalHispanic = B03002_012E,
         MedInc = B19326_001E,
         TotalHH = B07013_001E) %>%
  dplyr::select(-NAME, -starts_with("B0"), -starts_with("B1"), -starts_with("B2")) %>%
  mutate(pctWhite = (ifelse(TotalPop > 0, Whites / TotalPop,0))*100,
         pctBlack = (ifelse(TotalPop > 0, Blacks / TotalPop,0))*100,
         pctHis = (ifelse(TotalPop >0, TotalHispanic/TotalPop,0))*100,
         pctBachelors = (ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop),0)) *100,
         pctPoverty = (ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0))*100,
         pctCarCommute = (ifelse(TotalCommute > 0, CarCommute / TotalCommute,0))*100,
         pctPubCommute = (ifelse(TotalCommute > 0, PubCommute / TotalCommute,0))*100,
         year = "2018") %>%
  mutate(MedHHInc = replace_na(MedHHInc, 0),
         pctBachelors= replace_na(pctBachelors,0),
         pctHis= replace_na(pctHis,0),
         pctCarCommute= replace_na(pctCarCommute,0)) %>%
  dplyr::select(-Whites, -Blacks, -FemaleBachelors, -MaleBachelors, -TotalPoverty, -CarCommute, -PubCommute, -TotalCommute, -TotalHispanic)

#extract_geometries
phila_tracts <- 
  phila_census %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf
dat <- st_join(indego_2019_q2 %>% 
                        filter(is.na(start_lon) == FALSE &
                                 is.na(start_lat) == FALSE &
                                 is.na(end_lon) == FALSE &
                                 is.na(end_lat) == FALSE) %>%
                        st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
                      phila_tracts %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phila_tracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)
weather <- 
  riem_measures(station = "PHL", date_start = "2019-06-01", date_end = "2019-06-30") %>%
  dplyr::select(valid, tmpf, p01i, sknt, relh)%>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt),
            Humidity = max(relh)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

2.2 Exploring data

2.2.1 Weather data

The figure below shows the Precipitation, Wind speed, Temperature, and Humidity data points during the month of June in 2019. You can see from the Temperature graph that the latter half of the month became relatively warmer compared to the first half. The rationale behind using the month of June was because the month of June is a relatively pleasant temperature-wise where it won’t affect the ridership too much out of the ordinary.

grid.arrange(
  ggplot(weather, aes(interval60,Precipitation)) + geom_line(aes(color = palette1),) + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme + theme(legend.position = "none"),
  ggplot(weather, aes(interval60,Wind_Speed)) + geom_line(aes(color = palette1),) + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme + theme(legend.position = "none"),
  ggplot(weather, aes(interval60,Temperature)) + geom_line(aes(color = palette1),) + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme + theme(legend.position = "none"),
  ggplot(weather, aes(interval60,Humidity)) + geom_line(aes(color = palette1),) + 
    labs(title="Humidity", x="Hour", y="Humidity") + plotTheme  + theme(legend.position = "none"),
  top="Weather Data - Phiadelphia - April to June, 2019")

2.2.2 Overall Usage Patterns

The graph below shows the overall usage patterns for the study period. YOu can see that patterns are relatively similar from week to week, with the week of June 17 seems to have a lower peaks in general. This may be explained by the Precipitation graph from previous figure where there was a few days of heavy rain during the week of June 17.

ggplot(dat %>%
         group_by(interval60) %>%
         tally()) +
  geom_line(aes(x = interval60, y = n, color="palette1")) +
  labs(title="Bike share trips per hr. Philadelphia, June 2019",
       x="Date", 
       y="Number of trips") +
  plotTheme +
  theme(legend.position = "none")

2.2.3 Time of Day Distributions

The following set of graphs shows the average number of trips seen by each stations. The data have been put into four categories. AM Rush (7- 10am), Mid-Day (10am - 3pm), PM Rush (3 - 7pm), Overnight (7pm - 7am). You can see that most trips during the AM Rush have 1 or 2 trips, but there are noticeable amount of stations that see more than 3 to 5 trips. Stations during the PM Rush trips share a similar characteristics as the AM Rush, showing the long tail. On the other hand, the Mid-day and Overnight time periods show that most of the stations are seeing 1-2 bike rentals per stations.

dat %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))%>%
  group_by(interval60, start_station, time_of_day) %>%
  tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips, fill= palette1), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, June, 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme +
  theme(legend.position = "none")

2.2.4 Weekday vs. Weekend

We are also interested in understanding the trip characteristics by the days of the week. The first graph shows the trip count distributions by the hour of the day. You can easily notice the morning and afternoon peaks from Monday through Friday, while the Sunday and Saturday peaks are during the mid-day. The second graph groups the weekdays and weekends together to show the trip count characteristics more clearly.

ggplot(dat %>% mutate(hour = hour(start_time)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia, by day of the week, June, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme

ggplot(dat %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia - weekend vs weekday, June, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme 

Now that we have established that the peaks during weekdays and weekends differ, are the stations that are seeing lots of trip count volumes located in the same places? The following series of maps shows that the trips counts of each stations in the given time periods. The first row shows the weekday trips and the second row shows the weekend trips. You can see from the AM Rush trips during weekdays the stations with higher trip counts congregates outside of the Center City area, while the PM Rush map shows the opposite with darker dots( presenting stations with heavy usage) are located in the Center City. This shows that Indego users are commuters living in more residential neighborhoods in Fairmont neighborhood in the north and South Philly neighborhoods taking the bikes to get into workplaces then taking the bikes back out to residential areas during the PM Rush period.

The weekend trip characteristics shows that its peak (Mid-day) period’s heaviest used stations are located near Fairmont neighborhood closer to the museums. I can hypothesize that this may be residents and tourists using Indego bikes to visit museums and the restauraants in the area.

ggplot()+
  geom_sf(data = phila_tracts %>%
            st_transform(crs=4326))+
  geom_point(data = dat %>% 
               mutate(hour = hour(start_time),
                      weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))%>%
               group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.75, size = 1.5)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat$start_lat), max(dat$start_lat))+
  xlim(min(dat$start_lon), max(dat$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Philadelphia, April to June, 2019")+
  mapTheme

3. Feature Engineering

3.1 Space and Time

The following code block shows the process I took to create a space and time panel containing an observation for every possible space and time combinations.

length(unique(dat$interval60)) * length(unique(dat$start_station))
## [1] 96078
study.panel <- 
  expand.grid(interval60=unique(dat$interval60), 
              start_station = unique(dat$start_station)) %>%
  left_join(., dat %>%
              select(start_station, Origin.Tract, start_lon, start_lat, passholder_type, trip_route_category) %>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

nrow(study.panel)      
## [1] 96078
ride.panel <- 
  dat %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat, passholder_type, trip_route_category) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, phila_census %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

3.2 Creating time lags

In this section, I will be creating a time lag features to help understand the bike demand during specific hours of the day. The following codes creates the five seperate time lag features including:

  • lagHour
  • lag2Hours
  • lag3Hours
  • lag4Hours
  • lag12Hours
  • lag1day
ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))

The table below shows the correlation values for each of the time lag features. You can see that lagHour has the strongest correlations value of r=0.81. The value decreases as the time moves further away from the time of the event, with the lag12Hours has a negative correlation value of r=0.45. This results make sense, because a trip characteristics at a bike station at 9am and 9pm will not be correlated. The correlation value for lag1day goes back up to the similar value of lagHour, showing that there would be a similar bike demand in similar time periods on the next day.

timelags <-
    as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))
timelag_cortable <- 
    group_by(timelags, Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"),2)) %>%
    kable(caption = "Time Lag features Correlations", ) %>%
    kable_styling("striped")

timelag_cortable
Time Lag features Correlations
Variable correlation
lagHour 0.81
lag2Hours 0.58
lag3Hours 0.37
lag4Hours 0.19
lag12Hours -0.45
lag1day 0.79

3.3 Trip Count Animations

We have now looked at trip count characteristics by days of the week and by the time buckets we have created. The following animations show hourly trip counts on Monday, June 17, 2019 and Saturday, June 22, 2019. This gives a good picture of how the trips are distributed throughout the day by the locations of the stations.

# animation for a monday and saturday 
weekday_mon <-
  filter(dat, week == 24 & dotw == "Mon")

weekday_mon_panel <-
  expand.grid(
    interval60 = unique(weekday_mon$interval60),
    start_station = unique(dat$start_station))

animation_dat <-
  mutate(weekday_mon, Trip_Counter = 1) %>%
    right_join(weekday_mon_panel) %>% 
    group_by(interval60, start_station, start_lat, start_lon) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 9 ~ "7-9 trips",
                             Trip_Count > 9 ~ "10+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-9 trips","10+ trips"))

count_animation <-
  ggplot() +
    geom_sf(data = phila_census, color = "grey", fill = "transparent") +
    geom_point(data = animation_dat, aes(x = start_lon, y = start_lat, color = Trips), size=1)+
    scale_fill_manual(values = palette5) +
    labs(title = "Philadelphia Indego Bikeshare Demand for Monday, June 17, 2019 by Station",
         subtitle = "1 Hour Intervals: {current_frame}") + mapTheme +
    transition_manual(interval60)

animate(count_animation, duration=20, renderer = gifski_renderer())

# animation by bikeshare station
weekday_sat <-
  filter(dat, week == 25 & dotw == "Sat")

weekday_sat_panel <-
  expand.grid(
    interval60 = unique(weekday_sat$interval60),
    start_station = unique(dat$start_station))

animation_dat2 <-
  mutate(weekday_sat, Trip_Counter = 1) %>%
    right_join(weekday_sat_panel) %>% 
    group_by(interval60, start_station, start_lat, start_lon) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 9 ~ "7-9 trips",
                             Trip_Count > 9 ~ "10+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-9 trips","10+ trips"))

count_animation2 <-
  ggplot() +
    geom_sf(data = phila_census, color = "grey", fill = "transparent") +
    geom_point(data = animation_dat2, aes(x = start_lon, y = start_lat, color = Trips), size=1)+
    scale_fill_manual(values = palette5) +
    labs(title = "Philadelphia Indego Bikeshare Demand for Satday, June 22, 2019 by Station",
         subtitle = "1 Hour Intervals: {current_frame}") + mapTheme +
    transition_manual(interval60)

animate(count_animation2, duration=20, renderer = gifski_renderer())

4. Models

After exploring the datasets and engineering new features, a series of models are created to find a model that makes the best prediction with the least error terms. The dataset is first split into a training set including the first 3 weeks of June called ride.Train and testing set that includes the last two weeks of June called ride.Test.

ride.Train <- filter(ride.panel, week >= 24)
ride.Test <- filter(ride.panel, week < 24)
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation + passholder_type + trip_route_category,
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation + passholder_type + trip_route_category +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day, 
     data=ride.Train)

4.1 Predict for test data

Next, a nested dataframe is created using purr package in order to organize the data points by week. The purr packages allows the looping of the nested dataframe and allows for comparisons across different model’s specifications.

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(A = map(.x = data, fit = reg1, .f = model_pred),
           B = map(.x = data, fit = reg2, .f = model_pred),
           C = map(.x = data, fit = reg3, .f = model_pred),
           D = map(.x = data, fit = reg4, .f = model_pred),
           E = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

4.2 Error Metrics

The figure below shows the mean absolute errors of the five models created by each week. Model E , which has the time lag variables, shows the smallest MAE value compared to the other models. Model E has variables of:

  • start_station: station of where the trip was originated
  • hour(interval60): hour interval variable
  • dotw: day of the week
  • Temperature: temperature
  • Precipitation: precipitation
  • passholder_type: type of passes used in the trip
  • trip_route_category: type of one way or round trip trips
  • lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day: time lag variables
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

The series of line graphs below shows the accuracy of the five models in predicting actual observed ridership. Model E’s prediction seems to be the closest to the observed value of the testing set. One thing to note here is that, while it predicts better than the other models, Model E still fails to reach the peaks of the observed values, showing that there are some other factors/features that the model is missing.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed Philadelphia's Indego bike share time series", subtitle = "A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

In order to understand what stations the model isn’t predicting well, the MAE values for each stations were plotted. You can see that the stations around Center City tend to have a higher MAE value. This is problematic because Model E doesn’t predict well for the stations where there are a lot of demand, therefore making it difficult for Indego operators to allocate resources in re-balancing.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "E") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = phila_census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.7)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat$start_lat), max(dat$start_lat))+
  xlim(min(dat$start_lon), max(dat$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme

4.3 Space-Time Error Evaluation

A series of error terms were plotted to understand if there are any space-time relationship in where Model E predicts better or worse in a specific time periods we created earlier. Interpreting the data, Model E underpredicts in all time periods, while it may be predicting Weekday Mid-Day and Overnight time periods better than the other time periods.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "E")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

Plotting the MAE on a map helps with visualizing what stations see the most errors, meaning where the model is failing the most. The map of AM Rush during weekday shows that the stations on the just outside of Center City, where commuter take the bikes in, have higher MAEs. PM Rush weekday map shows the opposite of higher MAE in Center City locations. During the weekends, the MAEs of the stations during the Mid-day seems to be higher. This is once again problematic because the model is underpredicitng at locations where there is a higher demand.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "E")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phila_census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.7, alpha = 0.7)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat$start_lat), max(dat$start_lat))+
  xlim(min(dat$start_lon), max(dat$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

Focusing on the trips during AM Rush, the following figures shows the relationships of the errors as a functino of the socio-economic variables chosen: MedHHInc, pctBachelors, pctPubCommute,pctPoverty, and pctWhite. MedHHInc and pctWhite seem to have a positive correlations with the errors, meaning the model predicts better with the stations located in the lower median income and white population share.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw),
           pctPubCommute = map(data, pull, pctPubCommute),
           MedHHInc = map(data, pull, MedHHInc),
           pctWhite = map(data, pull, pctWhite),
           pctPoverty = map(data, pull, pctPoverty)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw, pctPubCommute, MedHHInc, pctWhite, pctPoverty) %>%
    unnest() %>%
  filter(Regression == "E")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, pctPubCommute, MedHHInc, pctWhite, pctPoverty) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="AM Rush Hour Trip Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

5. Cross-Validation

Random 100-fold cross validation is conducted to test the generalizability of the Model E.

library(caret)
fitControl <- trainControl(method = "cv", 
                           number = 100,
                           savePredictions = TRUE)

set.seed(3000)
# for k-folds CV

reg.cv <-  
  train(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation + passholder_type + trip_route_category +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day, 
        data = ride.panel,  
        method = "lm",  
        trControl = fitControl,  
        na.action = na.pass)

The graphs below shows the results from the 100 folds. The statistics seem to cluster together(if the ) which indicates that Model E may be generalizable to new data. The relatively normal histogram also confirms that our model would generalize to new data with some errors.

reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  ylim(0,1) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(
    legend.position = "none"
  )

reg.cv.dataframe<-
reg.cv$resample %>% 
  dplyr::select(MAE) %>%
  data.frame()

ggplot(data = reg.cv.dataframe, aes(MAE)) +
    geom_histogram() +
    labs(title="Distribution of Mean Average Error from 100 Fold Test",
        x="Mean Average Error",
        y="Frequency") +
  plotTheme

6. Conclusion

Overall, the model seems to be performing slightly better when it is generalized across all bike stations as seen in the cross validation process. However, it has issues with underpredicting peaks of bike demands at locations with the high demands. At AM rush period, the model was underpredicting on the stations on the residential areas outside of Center City, and the demand of stations in Center City durint PM Rush period was underpredicted as well. If this model was used, it may present issues for Indego administrators in re-balacing its bikes to meet the needs of the riders during different time periods. This result is not only detrimental for the Indego’s finances but also the City’s ability to provide multimodal options for Philadelphians. The model can be further improved by additional feature engineering . Being able to predict the demand for Center City locations during the PM rush can be a key as the stations see the biggest demand during weekday evenign commutes .